--- title: Analysis description: Here, we show some of our exploratory data analysis and the journey from our initial findings into our conclusion on the relationships between green space, race, age, and asthma hospitalization rates. toc: false featuredVideo: featuredImage: https://images.ctfassets.net/cxgxgstp8r5d/entry-cm_583-image/475d84453bda342d613dac0c5fa2d5db/entry-cm_583-image.jpg draft: false ---
The goal of this data analysis is to explore the relationship between asthma hospitalizations, as a measure of the human impact of air pollution, and the amount of green space in different California counties. We hope to learn whether higher levels of asthma hospitalization correlate to lower proportions of green space.
Although asthma hospitalization rates are not a perfect measure of air pollution, they are strongly linked, and California county data on asthma hospitalization rates is publicly available. Within asthma hospitalization rates, we’ll look specifically at age groups and race to assess the varied hospitalization rates of different groups. Racial makeup of a county is often correlated with socioeconomic status, and therefore we hope to examine whether areas with a greater POC (people of color) population have both higher hospitalization rates and less green space, which are both factors that can correlate with areas of lower socioeconomic status. However, it’s important to note that race is certainly not an exact metric for socio-economic status. Our observations may have implications on how socioeconomic levels of a county are linked to differing levels of asthma hospitalizations or green space, but we will not be conclusively defining that correlation within this study.
To assess each county’s green space, the proportion of county area which is parkland and the number of parks per county are used. Both are included because they look at green space in two different ways, and considering area helps to account for the fact that parks can be drastically different sizes. Park land data is an imperfect measure of tree or plant coverage because urban parks can contain few plants. Rural regions can have large areas of plant coverage remaining in private hands, and these areas can improve air quality despite not being open to the public. However, parks data is publicly available, and does generally give a good idea of the number of parks most public citizens in a county should have access to, as well as the area that these public green spaces cover. It would be helpful to have additional data that analyzes all land cover and divides it into percentages of grass cover, forest cover, building cover, and street cover, or similar categories, but that data was not available at this point.
Some of the major questions we are interested in answering include: Does a higher amount of open green spaces or a higher number of public parks correlate to lower asthma hospitalization rates across California counties? What are the differences in racial and age makeup of hospitalizations across counties? What is the relationship between the racial makeup of asthma hospitalizations and the amount of greenspace in a county? What is the relationship between age makeup of hospitalizations and greenspace? Is there a difference in the relationship between number of open parks and asthma hospitalization rate, and proportion of park land and asthma hospitalization rate? Which serves as a better predictor?
While exploring the data, we first examined how open greenspace and number of parks differed across California counties. We focused on the variables open park land, number of parks, and proportion of park-to-total county land. We discovered that although most counties only use about 10% of their land for parks, counties often have a high number of outlier areas. These areas are census tracts, which contain between 1200-8000 people each, which means some specific areas within each county have much larger amounts of open green space than the average. We also looked generally at the relationship between open park land and age-adjusted hospitalization rate by county but did not see any major correlation.
First, we load our datasets and mutate them to add columns with summary statistics for each county (code not shown for brevity).
We begin by exploring the trends in open green spaces across California counties.
We create boxplots that show the proportions of park-to-total land area for the census tracts within each county. Although the number of counties makes this a little hard to read, a few things stand out. Firstly, the median percentage of park land for census tracts in most counties is only about 10%, but there are a large number of outliers for many of the counties. Since census tracts usually contain about 4000 (1200-8000) people, this means that some areas must have much larger amounts of green space for relatively small amounts of people. Secondly, LA (light blue) stands out for the sheer number of census tracts. It might be interesting to remove LA from the data to see what trends are like without it.

Next, we look at some of the overall trends across counties regarding open park area and asthma hospitalizations and age-adjusted hospitalization rate.
Open Park Area in Census Tract, Organized by County and Colored by Hospitalization Totals
The plot below analyzes the total open park area, colored by the number of hospitalizations per census tract, within each CA county. These points are also from least open park area to most open park area, on average, for each county. Although there is not a clear relationship between total park area and hospitalization numbers, we can start to see that on the right side of the plot, the data points are generally dark purple (lowest levels of hospitalization). The pinker points (higher hospitalization levels) are generally on the left side of the chart. Therefore, the counties with the highest average open park area have low numbers of hospitalizations.

Using the parks_asthmaCA_kids_v_adults data again, the plot below renders the total open park area within each county in square miles (sum of all of the open park areas among tracts of that county) and also colors by the county’s number of hospitalizations. This plot helps resolve some of the overplotting from the previous plot, since this plot uses one summary point per county, rather than multiple points representing each tract within the county.
It appears that Riverside County (the county farthest to the right, with the greatest open park area) has a relatively high number of hospitalizations (pink, 1e+05 to 2e+05) compared to the other points. This point may be an anomaly, as it does not appear to be a part of the general trend of the plot (with most counties being colored purple, 0 to 1e+05). Another interesting point is El Dorado county, which appears yellow near the center of the plot. Despite being near the middle of the range for open park area, El Dorado has the highest number of hospitalizations out of the counties displayed, with 4e+05 plus hospitalizations.
In summary, from this plot there does not appear to be a distinct difference in number of asthma-related hospitalizations for counties that have lower versus higher square mileage of open park area.

Next, we’ll facet the data by age to get a sense of trends for each age group (kids ages 0-17 or adults 18+).
Faceted by Kids v. Adult: Age-Adjusted Hospitalization Rate by County and colored by County Open Park Area
Using the parks_athma_CA_kids_v_adults dataset, this plot shows that in general the age-adjusted hospitalization rate by county (which is the rate of hospitalizations for a county adjusted for the age group’s population in that county) is much higher for kids (0-17 years) than for adults (18+ years). As you can see from the plot, the highest hospitalization rate for adults is around 5, whereas the highest hospitalization rate for kids is above 15 and most of the counties have hospitalization rates for kids that are above 5. This tells us that, generally, kids have higher asthma hospitalization rates in every county.
This plot also shows that Fresno County has the highest asthma hospitalization rate for both kids and adults, making it a county that we could analyze further. However, Riverside County sticks out as having more than 3000 square miles of open park area, though its asthma hospitalization rate is relatively low compared to other counties. This puts our earlier plot into perspective. We noted that Riverside County had the most hospitalizations despite its high open park area, but looking at its age-adjusted hospitalization rate instead of its number of hospitalizations clarifies that Riverside County is not as interesting of a data point as we initially thought.
Lastly, this plot again confirms that there does not appear to be a distinct difference in asthma hospitalization rate for counties that have lower versus higher square mileage of open park area.

Next, we created maps (using a combination of the USAboundaries, sf, and tmap packages) to visualize the asthma hospitalization rates across different counties by each Race/Ethnicity. Since there was hardly any data for AI/AN category, we do not include it in our map. Note also: these maps were originally going to be featured in our interactive, but there were some issues with tmap and shiny. Since there are only three different maps, we decided to feature them here instead, as they are still important visualizations of our data!
## Warning: package 'sf' was built under R version 4.1.3
## Warning: package 'USAboundaries' was built under R version 4.1.3
## Warning: package 'tmap' was built under R version 4.1.3
Map of California Counties and Age-Adjusted Hospitalization Rate for White Individuals

Map of California Counties and Age-Adjusted Hospitalization Rate for Hispanic Individuals

Map of California Counties and Age-Adjusted Hospitalization Rate for Black Individuals

As seen from the map legends, the age-adjusted hospitalization rates for Black individuals have a much higher range overall. The maps for White and Hispanic individuals have highest rates in the ranges 5 to 6 and 6 to 8, whereas the highest range for Black individuals is 30 to 40. Therefore, these maps highlight the health disparity for Black individuals in California based on their disproportionately high rates of asthma hospitalization compared to their White and Hispanic counterparts.
Additionally, from all of these maps, Fresno County (the yellow county in the center of the state) pops out as a county of interest yet again, as it has the highest asthma hospitalization rate for each race/ethnicity group. Imperial County (at the very bottom of the state) appears to have a high asthma hospitalization rate compared to other counties.
ggplotly(sums %>% ggplot(aes(num_open_parks, number_hospitalizations, col = county_name)) + geom_point() + labs(x = "Count of Open Parks", y = "Hospitalizations", col = "County Name", title = "Count of Open Parks vs Total Number of Hospitalizations per County"))
ggplotly(sums %>% ggplot(aes(num_open_parks, age_adjusted_hospitalization_rate, col = county_name)) + geom_point() + labs(x = "Open Parks", y = "Hospitalizations Rate", col = "County Name"))
ggplotly(sums %>% ggplot(aes(x = total_open_park_area, y = number_hospitalizations, col = age_adjusted_hospitalization_rate)) + geom_jitter() + labs(x = "Open Parks Area", y = "Hospitalizations", col = "Hospitalization Rate"))
ggplotly(sums %>% ggplot(aes(x = total_open_park_area, y = number_hospitalizations, col = county_name)) + geom_jitter() + labs(x = "Open Parks", y = "Hospitalizations", col = "County Name"))
parks_asthmaCA_kids_v_adults_countystats <- parks_asthmaCA_kids_v_adults_countystats %>%
filter(!is.na(county_number_hospitalizations), !is.na(county_open_park_area))
mod1 <- lm(county_number_hospitalizations ~ county_open_park_area, data = parks_asthmaCA_kids_v_adults_countystats)
beta <- coef(mod1)
parks_asthmaCA_kids_v_adults_countystats %>% ggplot(aes(x = county_open_park_area, y = county_number_hospitalizations)) + geom_point() + geom_abline(intercept = beta[1], slope = beta[2], color = "red")
